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Disease occurs due to aberrant expression of genes and modulation of the biological pathways along which they 
lie. Inference of activated gene pathways, using gene expression data during disease progression, is an important 
problem. In this work, we have developed a generalizable framework for the identification of interacting pathways while 
incorporating biological realism, using functional data analysis and manifold embedding techniques. Additionally, we 
have also developed a new method to query for the differential co-ordinated activity of any desired pathway during 
disease progression. The methods developed in this work can be generalized to any conditions of interest. 
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1. INTRODUCTION 

A gene is a segment of the genome (DNA) that codes 
for protein. Proteins are the biological catalysts of 
function and mediate the kinetics and timing of any 
biological process. The body is composed of several 
tissues and each tissue has its own individual lineage 
of cells. The set of proteins that are active in each cell 
are different depending on cell context and the under- 
lying function of the cell. Since proteins are encoded 
by the gene(s), it is believed that biological processes 
are orchestrated by a precise spatio-temporal expres- 
sion of genes. Gene activities in the cell are organized 
along pathways, which might either have signaling 
activity or regulatory role in determining cell behav- 
ior. Thus even though there is a heterogeneous pop- 
ulation of cells in the body, each set of cells has a 
precise time and location of gene/pathway expres- 
sion corresponding to required protein activity. 

Systemic disease can occur due to the mis- 
expression of genes in various tissues [as an example, 
the GataS gene is mutated in HDR (hypoparathy- 
roidism, deafness, renal dysplasia)]. A lot of studies 
have studied the link between the genome and the 
phenome - i.e. between gene expression and physical 
characteristic. Disease (and its symptoms) is a physi- 
cal characteristic which arises from mis-expression of 
the underlying genome. Also, very rarely is disease 



due to the aberrant expression of a single gene - they 
mostly arise due to mis-expression of several genes at 
once (i.e. gene sets). Identifying these set of aber- 
rant genes (or pathways) is an important problem 
because of the immense therapeutic potential. There 
is an ongoing effort to find inhibitors that can tar- 
get disease-implicated pathways. For example, two 
well-known drugs Gleevec and Tarceva target recep- 
tor tyrosine kinase signaling pathways that have a 
reported role in cancer [10]. 

In this work, we develop methods to identify 
pathways (Part /) that have a role in disease progres- 
sion. Specifically, we ask which pathways are poten- 
tially modulated during onset and evolution of im- 
mune response to infection. Additionally, we present 
a framework (Part //) that can query the differential 
activity of 'any' pathway between normal and dis- 
eased states, thereby allowing for the principled se- 
lection of pathway inhibitors to modulate and control 
disease. Using time series expression profiles of gene 
expression, we use functional data analysis (FDA) 
to process, analyze and cluster the data into possible 
pathway components. In addition, we use a manifold 
embedding technique to improve on these results and 
extend this for generalized pathway querying. 



* Corresponding author. 
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2. OUTLINE 

This paper is organized as follows. To interpret 
pathway activity during immune response, section 
3 deals with the examination of the gene expression 
data during pathogen infection, and its representa- 
tion in terms of B-spline basis functions. Section 4 
uses principal component analysis on the functional 
data (fPCA) to discover modes of variation in the 
data. This is followed by clustering genes in fPCA 
space to find genes whose interaction is putatively as- 
sociated with infection. We find that the results are 
not directly relevant biologically and hence, section 
8 develops methods to improve the context of the 
clustering results to obtain more meaningful results. 
This new framework enables the solution of another 
hitherto unexamined problem - querying any arbi- 
trary pathway for co-ordinated differential activity 
between case and condition (section 9). As an exam- 
ple, we examine the activity of the Toll-like receptor 
(TLR) pathway for the pathogen infection data set, 
and demonstrate the general utility of this approach 
in such problems. Section 11 concludes the paper. 

3. DATA EXTRACTION AND 
PRE-PROCESSING 

One of the most common processes involving gene- 
pathway modulation is the systemic response of the 
immune system to an invading pathogen. With the 
advent of whole genome microarrays that can as- 
say the activity of genes over a time course, expres- 
sion profiling of genes during the innate and adap- 
tive immune response has been actively pursued for 
the identification of genes that can be used for di- 
agnostic or therapeutic purposes. For this study, in 
order to find pathways implicated in the immune re- 
sponse to pathogen infection, we use functional ex- 
pression data gathered by the Young group [14] . This 
data profiles the various gene activation programs 
initiated in macrophage cells on exposure to various 
pathogens such as tuberculosis, e.coli and staphylo- 
coccus aureus. There is also a set of control treat- 
ments in which latex beads coated without bacteria 
are presented to the macrophage cell population. In 
this study, we consider the differential pathway ac- 
tivity between tuberculosis and control conditions. 
The methods developed in our approach are however. 



more general and can be applied for any number of 
interesting conditions. 

The dataset contains expression values at 8 
time points for 168 unique macrophage genes. 
These correspond to gene expression profiling 
0.5, 1, 3, 6, 12, 16, 18, 24 hours after exposure with the 
pathogen (or control). Since macrophages exhibit 
an early innate as well as late adaptive immune re- 
sponse, it is interesting to examine which genes are 
expressed in a certain phase. 

To use the functional data approach on the raw 
data, we start by representing the functional data 
via B-spline basis functions. This choice of basis 
functions is primarily governed by the lack of inher- 
ent periodicity in immune response over the first 24 
hours, as well as the possibility of local structure 
in the time series that are are relevant to analysis. 
Using the "FDA" package in R, we create B-spline 
basis functions {Bk{tj)) of order 3 and J = 46 in- 
ternal knots over the interval [0.5,24]. Under this 
representation, we have, 

= Y^k=i ^kBk{tj) for i = 1, 2, . . . , n with 

n = 168. 

Furthermore, a smoothing operation is im- 
plemented on the data (using generalized cross- 
validation), with A = 0.001. The plots of the func- 
tional data after smoothing are displayed in Fig. 1(a) 
and Fig. 1(b) respectively. 

4. FUNCTIONAL PRINCIPAL 

COMPONENT ANALYSIS (fPCA) 

Functional PC A (fPCA) aims to find a solution 
to the eigenvalue problem [15, 16] : C(j)h = Ab, 
where C = [^27=1 ^hkCi,i/n], (j) = [{Bk.Bm)] and 
b = (6i, 62, . . . , 6/c). The j^^ principal component 
eigenvector hj of Cc/) leads to an estimate ej = 
[5i, ^2, . . . , BK]^hj of the eigenfunction. With this, 
the j^^ principal component score is given by aij = 
(xi^ej). The set of scores ([q^z,i, Q^z,2, • • • , <^i,p] ^ 
W;i = 1,2,..., 168) can then be used for cluster- 
ing [18]. 

To understand the modes underlying disease on- 
set and its response by the immune system, a fPCA 
analysis of the data was done. The first and second 
principal components of the tuberculosis functional 
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data are displayed in Fig. 2 (a) and (b) respectively. 
These harmonics correspond to the components af- 
ter varimax rotation to aid interpretability (15,16). 
The harmonic plots indicate two distinct behaviors 
and are indicative of typical immune response. 



Control (Smoothed) Data 
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(a) Smoothed Control time-series 
Data. 



Tuberculosis (Smoothed) Data 
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(b) Smoothed Tuberculosis time- 
series Data. 

Fig. 1. Smoothed functional data for immune response under 
tuberculosis and control. The x-axis denotes the time points 
in hours. 

Some interesting insights emerge from the plots 
of Fig. 2. The first harmonic corresponds (roughly) 
to the 'late' variation in gene expression (Fig. 2(a), 
accounting for ~ 78% variation) whereas the second 
principal component corresponds to the 'early' vari- 
ation (Fig. 2(b), accounting for ~ 20% variation). 
This is extremely meaningful because the principal 
components correspond to a drastic change in adap- 
tive immune response and is strongly associated with 
biological response to pathogen infection. This is 
also known from biological literature [14, 7]. 

The scores of the functional tuberculosis gene 
data along these first two principal components is 
shown in Fig. 3(a). 



3 

5. MODEL-BASED CLUSTERING 

Having found scores of each of the genes in fPCA 
space, our goal is to now group (cluster) genes with 
similar temporal profiles. In this section, we de- 
rive the parameter update equations for a Mixture 
of Gaussian clustering paradigm [6, 12]. 



PCA function 1 (Percentage of variability 78.6 ) 
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(a) 



PCA function 2 (Percentage of variability 19.2 ) 
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X 



(b) 

Fig. 2. First and Second Functional Principal Components 
for Tuberculosis data. X-axis corresponds to assay time 
points, as in the previous plots. We note that the solid line 
in each case corresponds to the mean function, the (+) line 
corresponds to the mean function added to a multiple of the 
harmonic and the (— ) line corresponds to a subtraction of the 
harmonic function. 

We consider the group of gene expression pro- 
files ^ = {g^-^\g^^\ . . . ,5'^^^}, ah of which share a 
common dynamic. Consider gene profile i, g^'^^ = 
[<^i,i, <^i,2, • • • , a J-dimensional random vector 
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(here J = 2) which fohows a /c-component finite mix- 
ture distribution described by: 



(1) 



m—1 



where ai, . . . ,a/e are the mixing probabihties, each 
(prn is the set of parameters defining the m^^ compo- 
nent, and = {(/)i, . . . , ai, . . . , a/c} is the set of 
complete parameters needed to specify the mixture. 
We have, 



> 0, m = 1, . . . , /c, and ^ = 1 (2) 



m—l 



PCA Scores for Tuberculosis Data 
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(a) Scores of functional data along functional 
principal components, PCl and PC2 
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(b) BIC plot during model-based clustering 

Fig. 3. Embedding of data onto fPCA space and BIC plot to 
find optimal cluster number. 



For a set of n independently and identically dis- 
tributed samples, 



(3) 



the log-likelihood of a /c-component mixture is given 
by: 



logpm=\og'[lp{g^'^\d) (4) 

i=l 

n k 

= H ^05- ^ amV{g^^ \(t>m) (5) 

2=1 m—l 

• Treat the labels, ^ = {z^^\ . . . , z^"^"^}, as- 
sociated with the n samples - as missing 



data. Each label is a binary vector z^*^ 



.(0 



where 



.(0 



1 and 



0, 



for p 7^ m indicates that sample g^'^^ was 
produced by the m^^ component. 

In this setting, the Expectation Maximiza- 
tion algorithm can be used to derive the cluster pa- 
rameter (6) update equations. 

In the E step of the EM algorithm^ the function 
Q{Oj{t)) = E[\ogp{W,2f\0)\^,0{t)],is computed. 
This yields. 



E-=i«i(iM5«l^i(i)) 



where is the posterior probability of the event 
= 1, on observing gij^ . 



The estimate of the number of components {k) is 
chosen using a bayesian information criterion (BIC) 
criterion [6, 12]. The BIC criterion borrows from in- 
formation theory and serves to select models of low- 
est complexity to explain the data. As can be seen 
below, this complexity has two components - the first 
encodes the observed data as a function of the model 
and the second encodes the model itself. Hence, the 
BIC criterion in our setup becomes, : 



ksic = argmax^{21ogp(^|^(/c)) 



k{Np + 1) 



logn}, 
(7) 



Np is number of parameters per component in the 
k component mixture, given the number of clusters 
k ■ < 
tions. 
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In the M step: For m = 0, 1, . . . , A:, Om{t + 1) = 
argmax0^ Q{6,6{t)), for m : am{t + 1) > 0, the el- 
ements of the parameter vector estimate are 
typicahy not closed form and depend on the specific 
parametrization of the densities in in the mixture, 
i.e. p{g^'^^ \(f>m) ' If p(g*^*^ |</>m) belongs to the Gaus- 
sian density jV{iJi^^ ^m) class, we have, (j) = (/l^, 
and EM updates yield 



am{t 



En 
i=l 



VJ. 



(i) 



1) 



+ = 



(g('^-Mn^(^+l))(g 



(i) 



(8) 
(9) 



En 



(10) 



The equations 5. 6,8,9,10 are the parameter 
update equations for each of the m = 1, . . . /c cluster 
components. 

The R package 'MClust' can be used to do model 
based clustering on the genes (represented as fPCA 
scores). The method enables the selection of the op- 
timal number of clusters via maximizing the bayesian 
information criterion (BIG). It outputs a plot that 
displays the BIG for each cluster assignment, where 
the clusters can have three independent degrees of 
freedom (shape, volume and orientation). The opti- 
mal cluster assignment is chosen from all possibilities 
of shape, volume, orientation and BIG [6]. 

6. fPCA CLUSTERING RESULTS 

Based on the Bayesian Information criterion (BIG), 
we select two clusters (Fig. 4(a)) with variable 
shape, volume and orientation (VVV). Addition- 
ally, to ascertain the purity of clustering, we ex- 
amine if co-clustered genes are in the same cellu- 
lar location. This is done using the FATIGO+ tool 
at hUp://babelomics. bioinfo.cipf.es/fatigoplus/. The 
results of this analysis is in Fig. 4(b). The results in- 
dicate that only about 50% of the co-clustered genes 
are co- located (in the nucleus). This casts serious 
doubts on the analysis methods of previous papers 
that use clustering in PGA/fPGA space as a method 
to discover novel pathway components. Biologically, 
the cellular proximity of two genes is essential for 



their interaction along a pathway. Thus, unless cel- 
lular proximity can be explicitly incorporated into 
this framework, such clustering can potentially fail 
in the discovery of true pathway components [1, 11]. 



Classification 
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(a) BIC based model clustering. 
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(b) GO purity of Clusters from fPCA. 

Fig. 4. BIC based clustering for fPCA embedded data and 
GO cluster purity. 



7. COMMENTS 

To recapitulate, there is a need to find molecular 
signatures that predict grade of disease/ therapeutic 
potential. The traditional approach to find dysreg- 
ulated pathways is based on clustering genes based 
on expression profiles (for the corresponding disease) 
- in fPCA space. This uses the hypothesis that co- 
clustered (co-expressed) genes are 'possibly' part of 
the same pathway - since genes with similar expres- 
sion profiles all belong to same pathway, i.e. their 
relationship is so tight that they will behave in the 
same coordinated manner. A lot of literature using 
this hypothesis is available. 
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However, this approach is questionable because 
in any biological process, several pathways are in- 
volved, and there are cross-interactions (cross-talk) 
among pathways. Clusters can thus consist of pu- 
tatively interacting pathways, not just one pathway. 
Because different conditions/diseases are due to dif- 
ferent aberrant pathways, clusters can have different 
sets of genes and thus several 'true' pathways. This 
leads to incorrect inference of the biology, because 
every study can find a'new' pathway based on which 
condition they study. In reality, there are only a few 
standard pathways - however their interactions are 
different in different diseases/conditions. 

Thus, we would need to incorporate some other 
prior knowledge to aid the clustering approach in 
achieving biological realism. One such way would be 
to use location information in conjunction with ex- 
pression and cluster genes in that combined space - 
this follows from fact that if genes have to have coor- 
dinated pathway activity, they should be nearby in 
cellular location. 

8. Part I: BUILDING REALISM WHILE 
CLUSTERING 

As suggested in the previous sections, it would be 
useful to have a "space" which respects physical 
cellular proximities in addition to expression sim- 
ilarities. This can be enabled by considering a 
set of annotations that describe the "cellular lo- 
cation" information for each of the genes (in the 
macrophage activation program). One set of anno- 
tations that is well researched by the bioinformatics 
community is the Gene Ontology (GO) descriptors 
{http://www.geneontology.org/). This is a controlled 
hierarchical vocabulary that annotates genes in var- 
ious organisms by cellular component (CC), molecu- 
lar function (MF) and biological process (BP), based 
on literature reports. 

The next section examines the generation of a 
"semantic similarity matrix" between genes based 
on their GO (CC) descriptors, to quantify the cel- 
lular proximity among them. Just like lexical word 
ontologies for spoken languages (e.g. WordNet at 
http://wordnet.princeton.edu/), this structure im- 
poses a tree structure on the various GO terms, 
thereby expressing the similarity between any two 
terms in the ontology as a function of their parents 



in the ontology tree. 

The next step involves the use of manifold em- 
bedding techniques that can integrate such GO sim- 
ilarity along with expression- level similarity to con- 
struct an embedding of the genes as points in some 
space. One such technique is Laplacian Eigenmaps 
[2], also profiled in Section: 8.2 that approximate 
both these relationships (semantic and expression 
similarities) . This is a generalization of the principal 
component approach in that the distance measures 
on such manifolds are not necessarily euclidian. Af- 
ter embedding the genes onto this manifold, we will 
then re-examine the model based clustering approach 
and assess the GO purity (as in section: 6) of the ob- 
tained clusters. 

We remind the reader that the main goal here 
is to embed genes based on their expression pro- 
files, but additionally weighted based on their cel- 
lular proximity - this would be more biologically rel- 
evant for the discovery of true pathway activity. We 
believe that such an approach is consistent with the 
rationale of using integrative genomics or principled 
heterogeneous data integration for stronger hypoth- 
esis generation [20]. 

8.1. GO Semantic Similarity 

To quantify the notion of similarity of terms along 
an ontology, we appeal to a vast amount of litera- 
ture that addresses such questions [4] . The semantic 
similarity of any two GO terms along the ontology 
hierarchy is based on the number of shared par- 
ents and the information content of the individual 
GO terms (measures: Jiang Conrath, Resnik etc.). 
Based on the literature, we use the Jiang- Conrath 
similarity measure, given by. 



jcdist{ci,Cj) = 2log{p{lso{ci,Cj))) - [log{p{ci)) + 
log{p{cj))] 

where q and cj are two terms (nodes) along the 
GO ontology tree (i, j G {1,2,..., 168}). lso{ci, Cj) 
refers to the the information content of the last com- 
mon parent of these two nodes. The information 
content is computed based on the probabilities of ob- 
serving the individual nodes and their last common 
ancestor in an overall corpus. 
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For the 168 genes profiled in tiiis study, we use 
the R package "GOSim" to obtain the semantic sim- 
ilarity matrix (size 168 x 168) based on GO location 
annotation. This similarity matrix is used to obtain 
the weight matrix W during the Laplacian Eigenmap 
embedding procedure [2] below. 

8.2. LLE (Laplacian Eigenmaps) 

• Build the K x K, {K = 168) dimensional 
weight matrix W from the Gene Ontology 
("Cellular Component") terms of the genes 
in the dataset. This distance is the "normal- 
ized" semantic similarity alluded to above 
(section 8.1). 

• Assign weight Wi,j, from (1) for each gene 
pair (^, j), for each of the (^) gene pairs. 
Note: The higher this weight, the closer the 
genes are. 

• Find n nearest neighbors using the euclidian 
distance in principal component space. The 
scores of the functional data along the first 
two principal components can be interpreted 
as co-ordinates in a euclidian space. 

• Form the Graph Laplacian: 

T^k^hk if i = j; 
-Wij if i is connected to j; 

otherwise. 



2 ^i.jiV'i 



• Solve: miUyy^Ly = 2 j 

yj?w,^j (2), 

subject to: 

— y^Dy = 1, and 

y^Dl = 0, 

where Di^i = Wj^i, a. diagonal weight ma- 
trix. 

• Embed the co-ordinates to a lower dimen- 
sional manifold, using the solution (the 
Laplacian Eigenmap) obtained from the 
minimization above. 



values solving Ly = XDy (neglecting 
the zero eigenvalue and its eigenvector). 
— If y = [7/1 , . . . , T/t^] is the collection of 
these eigenvectors, then the embedding 
is given by : 

Vi = (^ii, . . . ,^id)^, i.e., the d dimen- 
sional representation of the i^^ data 
point (gene). 

• In our representation, we take dimensional- 
ity, d = 2 and number of neighbors, n = 5. 
The final embedding of the functional data 
based on expression and location modalities 
is shown in Fig. 5(a). 



Laplacian Eigenmap Embedding 
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(a) Eigenmap embedded data. 
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The solution to (2) is given by the 
d generalized eigenvectors associated 
with the d smallest generalized eigen- 



number of components 

(b) BIC plot for Eigenmap based clustering. 
Fig. 5. Eigenmap based embedding and BIC plot. 



May 19, 2009 22:48 WSPC/Trim Size: llin x 8.5in for Proceedings 



ws-procsllx85'new'csb2 



8.3. Results with Laplacian Eigenmaps 

After transforming the original scores (from fPCA) 
based on the GO semantic similarity matrix the 
2D representation of the gene data is shown in Fig. 
5(a). Based on this new embedding, which leads to 
a very different visualization of the data compared 
to Fig. 3(a), we can once again use model based 
clustering in the new space. 

Based on this, the BIG plot for the Eigenmap 
embedded data is shown in Fig. 5(b). The high- 
est BIG corresponds to 8 clusters (Fig. 6(a)) with 
variable shape, volume and orientation (VVV). An 
examination of the GO ( "cellular component" anno- 
tation) is shown in Fig. 6(b). This indicates that 
the GO enrichment of genes that are close by (like in 
nucleus) is much higher (^ 70% compared to 55% be- 
fore). Furthermore, interrogation of a cluster in the 
new assignment (after Eigenmap embedding) clearly 
identifies known components of the cytokine path- 
way [7] (several interleukin members). This shows 
that an embedding, that respects location as well as 
expression simultaneously, identifies closely interact- 
ing pathway components via clustering. This aids in 
the development of biologically relevant hypotheses. 
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(a) Clustering on Eigenmap embedded data. 
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(b) GO purity of Eigenmap based clustering. 

Fig. 6. BIC based clustering for Eigenmap embedded data 
and GO purity. 

9. Part II: QUERYING PATHWAY 
ACTIVITY 

The first part of this work presents a principled 
framework to embed gene relationships based on ex- 
pression and cellular location. This framework can 
now be extended to understand the co-ordinated ac- 
tivity of genes constituting a pathway. We note 
that this question has not hitherto been asked in 
this context previously. Previous methods have only 
looked at identifying pathway genes based on expres- 
sion similarities. In the light of the previous analysis 
suggesting that plain clustering in expression space 
without regard to physical proximity might not be 
biologically relevant, this framework enables the in- 
tegration of multiple modalities to obtain much more 
relevant results. Embedding the data using a princi- 
pled approach enables the formulation of more com- 
plex queries such as coordinated pathway activity. 
The key question that can now be addressed un- 
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der the above framework is: How strongly is pathway 
'X' dysregulated between normal cells and diseased 
(tuberculosis infected) cells. A pathway is a set of 
interacting genes. More generally, one can query for 
the co-ordinated activity of any subset of interesting 
genes (generalizing the approach of [19, 9]. 



(n-l) S 



'ij ^ Vij 



1<J 

Here, X = Cp and Y = Tp. Sx and Sy are 
the standard deviations from the entries of the ma- 
trices X and F, respectively (for normalization); 
n = . The higher the value of the more 

similar the two distance matrices are. 



9.1. Querying Pathway Activity 

In order to query a pathway's activity, we can 
first obtain the known components (genes) of 
the pathway using resources such as the KEGG 
{http: //www. genome, ad.jp/kegg /pathway, html) or 
BioCarta {http://cgap.nci.nih.gov/Pathways/BioCarta. 
pathway repositories. 

For any query pathway P (consisting of k 
genes), we can find the co-ordinates of this subset on 
the manifold embedding obtained above. Let Cp be 
the inter-point distance matrix {k x k) of the k-gene 
pathway in the eigenmap for the control condition. 
Let Tp be the inter-point distance matrix of the k- 
gene pathway under the perturbation (tuberculosis 
infection) . 

In this setting, the question of querying path- 
way activity translates to the following question: 
How are the gene-gene associations among these k- 
components of the pathway "different" between con- 
trol {Cp) and case {Tp). 

This is addressed in the following section. The 
idea is to find a metric of similarity (or distance) be- 
tween the distance matrices {Cp and Tp). The Mantel 
correlation test has been used in ecological studies as 
a similarity metric. The other is a hitherto unused 
metric, termed the logDet divergence, that has been 
used in other applications (to quantify the distribu- 
tional divergence between two probability distribu- 
tions). 

9.2. Quantifying difference in association 
matrices 

Based on the above, the Mantel's test is used in eco- 
logical analysis (R package "vegan" ) to compare two 
(or more) spatial proximity matrices. In our context, 
we are interested to ask if the gene set that is close 
in one space (control) is also close in the other space 
(tuberculosis) . 

The Mantel correlation coefficient between two 
{k X k) matrices X and Y - is given by [13 ]: 



Finally, we can estimate a significance of 
r(Tp, Cp) via bootstrapping. This would involve the 
following steps: 

• Repeat the following procedure B{= 1000) 
times (with index b = 1, . . . , 5): 

Pathways) 

— Generate resampled (with replace- 
ment) versions of the matrices Cp, Tp, 
denoted by C^, T^ respectively. 

- Compute the statistic 0^ = r{T^, C^). 

• Construct an empirical CDF (cumulative 
distribution function) from these boot- 
strapped sample statistics, as Fq{0) = 
P(e <0) = ^ Y.b=i ^->o(^ = 0-0^), where 
/ is an indicator random variable on its ar- 
gument X. 

• Compute the true detection statistic (on the 
original time series) = ^{Tp^ Cp) and its 
corresponding p- value (po = 1 — Te(^o)) un- 
der the empirical null distribution Fq{0). 

• If Fe(^o) ^ (1 — <^)5 then we have that 
the true mantel correlation value is signifi- 
cant at level a, leading to rejection of null- 
hypothesis (no association). 

9.3. logDet divergence 

The logDet divergence has recently received a lot of 
interest in the machine learning community, mainly 
with regard to metric learning problems [8]. To 
see an example of how they arise, consider the the 
KuUback-Liebler (KL) divergence between two mul- 
tivariate Gaussian densities p{x;Cp) and p{x\Tp). 
This is given by: 

KL{p{x',Cp)\\p{x',Tp)) = lDid{Cp,Tp), where. 

Did is the logDet divergence between two posi- 
tive definite matrices Cp and Tp defined by: 

Did{Cp,Tp) = tr{CpT-^)-logdet{CpT-^)k {k 
is the rank of the matrices Cp and Tp) 

We note that the distance matrices Cp and Tp 
are positive semi- definite, but can be made posi- 
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tive definite via the addition of a constant term to 
its non-diagonal elements [5] (Cailiez method). Be- 
fore using the logDet criterion, the distance matrices 
need to be converted into correlation matrices (this 
can be done via a scaled exponential transforma- 
tion [2]. Additionally, since the KL divergence is 
not intrinsically symmetric, the symmetrized ver- 
sion, LDdist{Tp,Cp) = \Did{Cp,Tp) + ^Did{Tp,Cp) 
can be used instead. The higher the value of this 
symmetrized distance, the higher the dissimilarity 
between Cp and Tp. 

Finally, we can estimate a significance of 
LDdist{Tp^Cp) via bootstrapping. This would in- 
volve the following steps: 

• Repeat the following procedure B{= 1000) 
times (with index h = 1, . . . , 5): 

(1) Generate resampled (with replace- 
ment) versions of the matrices Cp, T^, 
denoted by C^, respectively. 

(2) Compute the statistic 0^ = 

• Construct an empirical CDF (cumulative 
distribution function) from these boot- 
strapped sample statistics, as Fq{6) = 
P(e <0) = ^ Eti 4>o(^ = 0-0')^ where 
/ is an indicator random variable on its ar- 
gument X. 

• Compute the true detection statistic (on the 
original time series) Oq = LDdist{Tp^ Cp) and 
its corresponding p- value {po = 1 — Fq{6o)) 
under the empirical null distribution Fq{6). 

• If Fq{6o) > (1 — a), then we have that the 
true logDet value is significant at level a, 
leading to rejection of null- hypothesis (com- 
plete association). 

10. CASE STUDY:TLR PATHWAY 

As an example of querying pathway activity, we anal- 
yse the toll-like receptor (TLR) pathway. Members 
of the toll-like receptor (TLR) gene family convey 
signals stimulated by various pathogenic factors, ac- 
tivating signal transduction pathways that result in 
transcriptional regulation and stimulate innate im- 
mune function (Fig. 7(a)). Hence it is one of the 



earliest and most strongly activated pathways dur- 
ing immune response. 




(a) TLR pathway (©Biocarta). 



Histogram of Total logDet Divergence 



0.46 0.48 0.50 0.52 0.54 0.56 

Total Divergence (over 1000 Bootstrap samples) 

(b) Null Histogram of the logDet divergence for 
the TLR pathway. True value=0.5513. Is Acti- 
vated Early in Immune Response 

Fig. 7. The TLR pathway and its Divergence between nor- 
mal and case conditions. 

For this pathway, we get the subset of genes that 
are common between the Biocarta TLR catalog and 
the set of 168 genes. For those 7 genes, we can find 
the inter-point distance matrices along the normal 
and infection cases. The value (also referred to as 
the true value) of the mantel correlation and logDet 
divergence (between the normal and control states) 
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is 0.0447 and 0.5513 respectively, suggesting that the 
association between these two distance matrices is 
fairly low. Additionally, these correlation and di- 
vergence values are significant at the 0.05 level with 
respect to the null distribution (Fig. 7(b)). Hence, 
these two measures are useful at picking up a truly 
activated pathway between these two conditions. In 
the same way, one can obtain these two measures for 
any chosen pathway of interest, yielding results that 
are concordant with literature [7]. This is shown in 
Table /. 



Histogram of Total logDet Divergence 



Total Divergence (over mDa Bootstrap samples) 

(a) Null Histogram for the TLR path- 
way components. True value=0.5513. 
Is Activated Early in IR. 

Histogram of Totai iogDet Divergence 



Total Divergence (over mDa Bootstrap samples) 

(b) Null Histogram for the Mapk 
pathway components. True-value: 
0.018. Not activated early. 

Fig. 8. Bootstrapping results from Toll-like Receptor (TLR) 
and Mapk pathways. 



11. CONCLUSIONS 

In Part / of this work, we demonstrated a generaliz- 
able method to infer pathway components (or cross- 
talking pathways). Using Laplacian Eigenmaps, we 
were able to co-embed genes based on expression and 
location modalities. Model based clustering on em- 



bedded data further confirms that genes that are co- 
clustered also have higher purity with respect to cel- 
lular location. Additionally, some of the co-clustered 
genes belong to canonical pathways. 

From the 2D space obtained in Part /, we de- 
velop a novel framework (Part //) to query the activ- 
ity of any gene set (or pathway of interest) across bi- 
ological conditions using the Mantel correlation and 
logDet divergence. 

The overall contribution of this work is the devel- 
opment of a complete workflow that combines func- 
tional data analysis on expression data with ontol- 
ogy to yield biologically relevant results via hetero- 
geneous data integration. Though there has been 
some previous work [11, 1] combining gene expres- 
sion with ontology to understand gene co-regulation, 
we are aiming to do this for whole pathways or gene 
sets. 

12. EXTENSIONS AND FUTURE WORK 

The methods developed in this work, both for embed- 
ding genes based on expression and cellular location 
are applicable for any study of interest and thus eas- 
ily generalizable. Additionally, the query for the co- 
ordinated activity of any gene set of interest (path- 
way or otherwise) is also generalizable since it only 
examines the association of the inter-gene distance 
matrix (along the manifold) between case and condi- 
tion. This procedure also enables a relative ranking 
of multiple pathways, thereby allowing for simulta- 
neous queries. 

This work also expands on previous approaches 
in heterogeneous data integration, combining modal- 
ities like gene ontology with gene expression specifl- 
cally for pathway query along an biological process of 
interest. This could further enable efforts to under- 
stand pathogenesis through the modulation of path- 
way activity between the normal and diseased cell 
state. 

Finally, though this work uses the Mantel corre- 
lation and the logDet divergence for determining the 
difference in distance matrices, several other meth- 
ods such as Procrustes alignment [3], or the Jensen- 
Shannon distributional divergence can be used for 
the same purpose. It would be interesting to see 
if any of the methods make fewer distributional as- 
sumptions on the structure of the inter-point dis- 
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Table 1. Mantel and logDet values of some interesting pathways. 



Pathway Name 


Mantel test (value and significance) 


logDet divergence (value and significance) 


Apoptosis 


0.5961(0.038) 


0.149(0.025) 


Toll-like receptor (TLR) 


0.047(0.0255) 


0.5513(0.032) 


Mapk 


0.3373(0.142) 


0.018(0.010) 


T/B-cell 


0.271(0.071) 


0.1523(0.006) 



tance matrices. 



AVAILABILITY 

The source code of the analysis tools (in R 2.6) is 
available on request. The gene expression data is 
publicly available at: 
http://web.wi.mit. edu/young/. 
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